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Abstract 

Recent advances in formal verification techniques enabled 
the implementation of distributed systems with machine- 
checked proofs. While results are encouraging, the impor- 
tance of distributed systems warrants a large scale evaluation 
of the results and verification practices. 

This paper thoroughly analyzes three state-of-the-art, for- 
mally verified implementations of distributed systems: Iron- 
Fleet, Verdi, and Chapar. Through code review and testing, 
we found a total of 16 bugs, many of which produce serious 
consequences, including crashing servers, returning incor- 
rect results to clients, and invalidating verification guaran- 
tees. These bugs were caused by violations of a wide-range 
of assumptions on which the verified components relied. Our 
results revealed that these assumptions referred to a small 
fraction of the trusted computing base, mostly at the inter- 
face of verified and unverified components. Based on our 
observations, we have built a testing toolkit called PK, which 
focuses on testing these parts and is able to automate the de- 
tection of 13 (out of 16) bugs. 

1. Introduction 

Distributed systems, complex and difficult to implement cor- 
rectly, are notably prone to bugs. This is partially because 
developers find it challenging to reason about the combina- 
tion of concurrency and failure scenarios. As a result, dis- 
tributed systems bugs pose a serious problem for both ser- 
vice providers and end users, and have critically caused ser- 
vice interruptions and data losses [58]. The struggle to im- 
prove their reliability spawned several important lines of re- 
search, such as programming abstractions [5, 38, 46], bug- 
finding tools [27, 39, 55, 56], and formal verification tech- 
niques [23, 30, 36, 54], 
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Figure 1: An overview of the workflow to verify a distributed 
system implementation. 

Formal verification, in particular, offers an appealing ap- 
proach because it provides a strong correctness guarantee 
of the absence of bugs under certain assumptions. Over the 
last few decades, the dramatic advances in formal verifica- 
tion techniques have allowed these techniques to scale to 
complex systems. They were successfully applied to build 
large single-node implementations, such as the seL4 OS ker- 
nel [28] and the CompCert compiler [35], More recently, 
they enabled the verification of complex implementations of 
distributed protocols, including IronFleet [23], Verdi [54], 
and Chapar [36], which are known to be non-trivial to im- 
plement correctly. 

At a high level, verifying these distributed system imple- 
mentations follows the workflow shown in Figure 1. First, 
developers describe the desired behavior of the system in a 
high-level specification, which is often manually reviewed 
and trusted to be correct. Developers also need to model 
the primitives, such as system calls provided by the OS, on 
which the implementation relies upon; we refer to this as the 
shim layer. Finally, developers invoke auxiliary tools (e.g., 
scripts) to communicate with a verifier and print results. The 
specification, the shim layer, and auxiliary tools, as well as 
the components they glue together, are part of the trusted 
computing base (TCB). If the verification check passes, it 
guarantees the correctness of the implementation, assuming 
the TCB is correct. 
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(*) IronFleet’s specification does not guarantee exactly-once semantics. See §5. 
Figure 2: Summary of the verified distributed systems we analyzed. 


This paper conducts the first empirical study on the cor- 
rectness of formally verified implementations of distributed 
systems. While formal verification gives a strong correct- 
ness guarantee under certain assumptions, our overarching 
research goal is to understand the effectiveness of current 
verification practices: what types of bugs occur in the pro- 
cess of verifying a distributed system, and where do they oc- 
cur? We focus on possible bugs in distributed systems; bugs 
in external components, such as the OS, the verifier, or the 
hardware, are beyond the scope of this paper. 

In particular, this paper addresses the following three 
research questions: 

1. How reliable are existing formally verified distributed 
systems and what are the threats to their correctness? 

2. How should we test the assumptions relied upon by veri- 
fication? 

3. How can we move towards real-world, “bug-free” dis- 
tributed systems? 

To answer these questions, we studied three state-of-the- 
art verified distributed systems (Figure 2). We acknowledge 
that these systems, although with a formal correctness proof, 
are research prototypes; our ultimate goal is not to find bugs 
in them, but rather to understand the impact of the assump- 
tions made by formal verification practices when applied to 
building distributed systems. §3 will provide a detailed dis- 
cussion of our methodology. 

Our four main contributions follow. Surprisingly, we have 
found 16 bugs in the verified systems that have a negative 
impact on the server correctness or on the verification guar- 
antees. Importantly, analyzing their causes reveals a wide 
range of mismatched assumptions (e.g., assumptions about 
the unverified code, unverified libraries, resources implicitly 
used by verified code, verification infrastructure, and spec- 
ification). This finding suggests that a single testing tech- 
nique would be insufficient to test all the assumptions that 
actually fail in real-world scenarios when building verified 
distributed systems; instead, developers need a similarly di- 
versified testing toolkit. 

Second, we observe that the identified bugs occur at the 
interface between verified and other components, namely 
in the specification, shim layer, and auxiliary tools, rather 
than in the rest of the system (e.g., the OS). These interface 
components typically consist of only a few hundred lines 
of source code, which represent a tiny fraction of the entire 
TCB (e.g., the OS and verifier). However, they capture im- 
portant assumptions made by developers about the system; 


their correctness is vital to the assurances provided by veri- 
fication and to the correct functioning of the system. 

Third, none of these bugs were found in the distributed 
protocols of verified systems, despite that we specifically 
searched for protocol bugs and spent more than eight months 
in this process. This result suggests that these verified dis- 
tributed systems correctly implement the distributed system 
protocols, which is particularly impressive given the noto- 
rious complexity of distributed protocols. The absence of 
protocol bugs found in the verified systems sharply con- 
trasts with the results of an analysis we conducted of known 
bugs in unverified distributed systems. This analysis con- 
firms that even mature, unverified distributed systems suffer 
from many protocol-level bugs. It suggests that these verifi- 
cation techniques are effective in significantly improving the 
reliability of distributed systems. 

Finally, based on the evaluation results and with the goal 
of complementing verification techniques, we built the PK 1 
testing toolchain that detects the majority of the bugs found. 
The toolkit can be generalized to find similar bugs in other 
verified systems. Inspired by the findings of our study, we 
limited testing to the components that were found to be the 
source of bugs for real-world verified systems. In particular, 
our toolchain does not test verified components; it tests only 
the TCB and, additionally, it focuses testing on the interface 
between verified and unverified components. 

2. Background 

This section provides background on verification techniques, 
replicated distributed protocols, and the verified distributed 
systems that we analyzed. 

2.1 Machine-Checked Verification 

An important technique to formally reason about systems 
relies on the programmer writing formal proofs. As opposed 
to pen-and-paper proofs, machine-checked proofs provide 
the assurance that each step in the proof is correct — a key 
factor given that proofs can be extensive and complex. 

Verification provides a formal guarantee that the system 
satisfies a specification, which consists of: (1) a formal de- 
scription of the properties (behavior) that the system must 
satisfy, and (2) the assumptions made about the environment, 
such as the network and file system. 

The specification is a critical concept in verification. The 
specification is important to (a) informally convince devel- 
opers that the system has the properties that they desire and 
(b) formally verify other systems through compositional ver- 
ification techniques. The former application of the specifica- 
tion relies on the fact that the specification is often smaller 
and simpler than the implementation, increasing developer’s 
confidence that it is correct through manual inspection. 

Importantly, all formal guarantees provided by machine- 
checked verification are valid as long as the trusted com- 

1 PK is an acronym for Panacea Kit 
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Figure 3: Bugs that our analysis found in the high-level specification, verification tool, and shim layer of verified distributed systems. Some 
bugs caused servers to crash or to produce incorrect results, and most bugs are detected by our testing toolchain (PK). We reported all listed 
bugs to developers, except bug V6 and bug V7, which the developers had already fixed. 


pitting base (TCB) is correct. For verified systems, the TCB 
includes: the specification, the verification tools (e.g., veri- 
fier, compiler, build system), and the runtime infrastructure 
(e.g., libraries, OS, hardware). With a correct TCB, verifica- 
tion ensures that the implementation “bug-free.” 

2.2 Replicated Distributed Protocols 

The systems we studied implement replicated distributed 
protocols. IronFleet and Verdi implement replicated state 
machine protocols (MultiPaxos [32] and Raft [47], respec- 
tively), while Chapar implements a replicated key-value 
store that provides causal consistency [2, 40]. This section 
provides background about these protocols. 

Replicated state machine protocols. Replicated state 
machine (RSM) protocols replicate an arbitrary state ma- 
chine over a set of replicas while providing the abstrac- 
tion of a single server running a single state machine. Both 
MultiPaxos and Raft, which are leader-based, aim to pro- 
vide fault-tolerance under the crash-fault model where repli- 
cas and clients communicate over asynchronous networks. 
MultiPaxos and Raft provide linearizable [24] semantics to 
clients — the strongest consistency guarantee [33]. 

Causal consistency protocols. Causal consistency, a 
weaker form of consistency, uses the notion of potential 
causality [29]. It imposes fewer restrictions on implemen- 
tation behavior than linearizability, potentially improving 
performance while still providing intuitive semantics. Lloyd 
et al. [40] and Ahamad et al. [2] proposed two different al- 
gorithms for causal consistency. 

2.3 Verified Systems Surveyed 

We survey three state-of-the-art verified distributed systems: 
IronFleet, Verdi, and Chapar. 


IronFleet. IronFleet proposes a methodology to ver- 
ify distributed systems that relies on state machine refine- 
ment [1, 20, 30] and Floare-logic verification [17, 25]. It 
provides a verified implementation of a MultiPaxos server 
library (§2.2) and an implementation of a counter that uses 
the library. 2 IronFleet aims at proving the safety (lineariz- 
ability) and liveness of the MultiPaxos library. 

IronFleet is implemented and verified using Dafny. The 
Dafny compiler and verifier [34] relies on a low-level com- 
piler and verifier (Boogie [4])andonanSMT solver (Z3 [15]). 
Some of IronFleet’s non-verified implementation code is 
written in C#. 

Verdi. Verdi’s methodology to verify distributed systems 
relies on a verified transformer [54]. It verifies a server 
implementation of the Raft protocol (§2.2) 2 and seeks to 
prove its safety properties (linearizability). Verdi provides 
durability (operations are written to disk) and implements 
recovery (replicas can recover from a crash). 

Verdi is verified and implemented using the Coq proof 
assistant [14]. It invokes Coq to translate its verified code 
into OCaml, which is then either interpreted by the OCaml 
interpreter or compiled into a binary by the OCaml compiler. 
Some of Verdi’s non-verified code is written in OCaml. 

Chapar. Chapar proposes a methodology to verify dis- 
tributed systems with causal consistency semantics. It ver- 
ifies a key-value store that implements the two causal con- 
sistency algorithms described in §2.2. Chapar seeks to prove 
the safety properties of both servers and clients. 

Like Verdi, the Chapar server is implemented and veri- 
fied using Coq and OCaml. Chapar also verifies the client 
application using model checking. 


2 IronFleet and Verdi implement additional, simpler protocols. We consider 
these to be secondary and outside the scope of this paper. 
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3. Methodology 

This section describes the methodology we used in our study 
and discusses some of its limitations. 

3.1 Scope 

Our study analyzed two aspects of each verified distributed 
system: 

1. Overall correctness. We studied the overall correct- 
ness of the server implementation (shim layer and verified 
code). 3 Thus, we did not restrict the analysis to verified com- 
ponents or verified properties. 

2. Verification guarantees. We studied the specification 
and verification tools used to verify the systems. This analy- 
sis allowed us to understand the extent to which formal ver- 
ification guarantees cover properties and components. 

3.2 Analysis Techniques 

We relied on the following methods to analyze the correct- 
ness of the implementations and their formal guarantees. 

Analysis of code and documentation. We analyzed the 
verified systems’ source code and specification. In addition, 
we leveraged existing documentation to understand their 
design. We identified assumptions the systems made and 
formulated hypotheses about missing or incorrect functions 
that could constitute bugs. 

Testing of implementation. We tested the implementa- 
tions using a network and file system fuzzer, and we devel- 
oped test cases to check the correctness of different compo- 
nents. Furthermore, we applied traditional debugging tech- 
niques, such as debuggers and packet sniffers, to gain a bet- 
ter understanding of the implementations and to confirm or 
rebut our hypotheses throughout our study. We incorporated 
the testing tools we developed into our PK toolchain (§3.3). 

Comparison of systems and interaction with devel- 
opers. We cross-checked the different verified systems by 
checking whether bugs found in one such system also ex- 
isted in the others. In addition, we checked whether bugs 
found in non-verified systems existed in the verified systems 
analyzed as well. We reported to developers of the verified 
systems the bugs we found using their issue trackers. 

3.3 PK Testing Toolchain 

Our analysis of the verified systems identified a series of 
specific bugs that impaired their overall correctness or veri- 
fication guarantees. Importantly, these bug examples, which 
were gathered using the methods explained in §3.2, enabled 
us to develop the PK testing toolchain that systematizes the 
search for similar bugs. Figure 4 provides an overview of 
prescribed approaches to improve the reliability of verified 
systems. We adopted some of these approaches in our testing 
toolchain (§4.5, §5.2, and §6.2). 


3 We focus on servers because they typically contain most of the complexity 
in distributed systems. 


Shim layer 

Verifying additional components of the system (§4.4, §4.5, §8.2) 
Verifying resource usage and liveness properties (§4.4) 

Improving documentation of libraries (§4.4) 

Testing focused on the shim layer (§4.5) 

Testing implicit resource usage (§4.5) 

Specification 

Proving specification properties (§5.2) 

Verifying applications using specifications of underlying layers (§5.2) 
Testing specifications (§5.2) 

Verification tool 

Designing fail-safe verifiers (§6.2) 

Testing verifiers (§6.2) 

Figure 4: An overview of prescribed approaches to improve the 
overall reliability of verified software. 

3.4 Study Limitations 

We focused our efforts on analyzing the source code of the 
implementations and the specification of the verified sys- 
tems. Therefore, it is possible that our results could under- 
represent bugs in other parts of the TCB. 

As with other bug studies, it can be difficult to reason 
about the number of false negatives, bugs that may exist but 
were not found. This applies to bug studies that rely on bugs 
reported by users [18, 42, 52], bugs found by testing tools [6, 
7, 13, 19], and bugs found through a mix of testing tools 
and manual inspection, like ours. Despite this challenge, we 
aimed to be systematic using two separate means. First, we 
cross-checked the bugs found in each verified system against 
the other verified systems. Second, we analyzed the verified 
systems by iteratively formulating and checking hypotheses, 
which included hypotheses based on bugs that were found in 
other non-verified distributed systems, such as those found 
by Scott et al. [49]. 

Regarding false positives, on the other hand, the fact that 
we reported the bugs and that nearly all of them were al- 
ready either fixed or confirmed by the developers provided a 
degree of high confidence that these were indeed bugs. Our 
study analyzed a relatively small number of verified systems 
(3) and found a limited number of bugs (16); small values 
necessarily require care in generalizing results. Neverthe- 
less, our study analyzed the largest number of verified im- 
plementations and the largest number of bugs, relative to all 
previous studies of which we are aware [57]. 

4. Shim Layer Bugs 

We classified the shim layer bugs into three categories: (1) 
RPC implementation, (2) disk operations, and (3) resource 
limitations. All these bugs resulted from a discrepancy be- 
tween the shim implementation and the expectation (also 
known as low-level specification) that the verified compo- 
nent held regarding the shim implementation. 

4.1 RPC Implementation 

We found five shim layer bugs affecting the client-server and 
server-server communication. These bugs were all caused by 
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// Client A, B and C run on different servers 
1: Client A: PUTC'key", "NA") 

2: Client A: PUT("key'', "Request”) 

3: Client B: GET("key") = "Request" 

4: Client B: PUT("key-effect" , "Reply") 

// Packet sent by request 1 is duplicated 
5: Client C: GET("key-effect") = "Reply” 

6: Client C: GET("key") = "NA" 

Figure 5: Test case that violates causal consistency (Bug Cl ). Client 
C reads the effect event ("Reply") but not the cause event ("Re- 
quest") that Client B read. 

mismatched assumptions about the network semantics, with 
respect to its failure model and limits, or the RPC input, with 
respect to its size and contents. 

Bug VI : Incorrect unmarshaling of client requests throws 
exceptions. 

The Verdi server used TCP to receive client requests but 
wrongly assumed that the recv system call would return 
the entire request in a single invocation. Because of this 
assumption, the server tried to unmarshal a partial client 
request, which caused it to throw an exception. The fix for 
this problem is to accumulate the received data in a buffer 
until the complete request is received. 

Bug Cl : Duplicate packets cause consistency violation. 

We found that Chapar servers, which sent updates to 
each other through UDP, accepted duplicate packets and 
always reapplied the updates. As a result, we were able to 
construct a test case (Figure 5) that caused a client to see 
results that violate causal consistency [29], Client C is able 
to see the effect of an event ("Reply") by reading "key- 
effect" but is unable to see the cause ("Request") by reading 
"key". In addition to violating causal consistency, accepting 
duplicate updates that are interleaved with other updates 
to the same key prevented monotonic reads, an important 
session guarantee property [53]. 

This bug resulted from the assumptions of Lloyd’s algo- 
rithms [40] that were implemented by the server, which as- 
sume a reliable network. The updates sent between servers 
(reflecting the PUT operations from clients) contained a de- 
pendency vector that used Lamport clocks. Before applying 
a new update, the server checked that all causally preceding 
updates had already been applied. Unfortunately, given the 
use of UDP, this check is not enough in an unreliable net- 
work. An old update would satisfy this check and be reap- 
plied, but applying it could overwrite later updates. 

BugC2: Dropped packets cause liveness problems. 

The server shim layer implementation did not handle 
packet drops because it assumed that the network layer was 
reliable. Given that the server relied on UDP sockets to ex- 
change updates, the practical result is that a single packet 
drop prevented clients on the receiver server from ever see- 
ing the respective update. This problem was made worse be- 


// Initialize 
PUT("key1", "valuel") 

PUT("key2”, "value2") 

PUT("key3", "value3”) 

// Inject GET("key2”) request 

GET("key1 - - \n1 32201 621 216857 GET key2”) = "valuel” 

// GET requests return wrong values 
GET(”key1”) = "value2" // Wrong value 
GET(”key2”) = "valuel” // Wrong value 
GET(”key3”) = "value2" // Wrong value 

Figure 6: Command injection vulnerability (Bug V2). 

cause a single dropped packet could also prevent the clients 
from reading subsequent updates: the causal dependency 
check would prevent subsequent requests from being ap- 
plied. The fix to this problem would be to implement re- 
transmission and acknowledgment mechanisms. 

Bug V2: Incorrect marshaling enables command injection. 

Figure 6 shows a sequence of requests that, when exe- 
cuted, allowed the client to cause the server to execute mul- 
tiple commands by invoking a single command. As the ex- 
ample shows, this bug also causes subsequent requests to 
return incorrect results. 

This bug resulted from incorrect marshaling; the server 
used meta-characters (newlines and spaces) to distinguish 
commands and command arguments but it did not escape the 
meta-characters. As a result, if the client invoked a command 
request with specially crafted arguments, it caused the RSM 
library to interpret that invocation as two or more distinct 
requests. In addition, after injecting commands, subsequent 
requests returned incorrect results because the client-server 
protocol expected each invocation to be followed by exactly 
one response (and had no other way to pair responses with 
replies). Further, due to this bug, some arguments crashed 
the server because they led to messages that did not comply 
with the format expected by the server, causing the marshal- 
ing function to throw an uncaught exception. 

The fix to this problem would be to change, at both ends, 
the client-server communication protocol to ensure that any 
argument value can be sent by the client. This could be 
achieved either by escaping meta-characters or by adopting 
a length-prefix message format. 

Bug C3 : Library semantics causes safety violations. 

Our tests demonstrated that, under certain conditions, 
the server could sent corrupted packets to other servers 
containing command arguments that were not provided by 
the client. This bug violated an expected safety property 
(namely, integrity) because the corrupted packets were ap- 
plied by the servers to their storage and were made visible 
to other users. 

Interestingly, this bug resulted from the semantics of 
the OCaml library function Marshal . to_channel(), which 
Chapar uses to marshal messages sent between servers. It 
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Figure 7: Exception while invoking the marshaling function left 
data in the internal buffers of the channel (Bug C3). 

was triggered if the server tried to marshal and send a mes- 
sage that did not fit within the UDP packet limit. When this 
occurred, sending the packet would fail; more importantly, 
the headers and partial data would be kept in internal buffers 
of the OCaml library. In other words, the prefix of the mes- 
sage was internally buffered while the suffix was discarded. 
Our test case demonstrated that the buffered data, in turn, 
could be concatenated with subsequent requests to construct 
a packet that had the correct format but incorrect content. 

As Figure 7 shows, the function Marshal . to_channel () 
broke down an OCaml object and serialized its subcompo- 
nents (which are the elements of a list in Chapar’s case). 
After converting each subcomponent into a byte representa- 
tion, the marshaling function invoked the channel write func- 
tion, caml_put_block(). OCaml channels can have differ- 
ent types (e.g., UDP socket, TCP socket, or files). However, 
the channel write function internally buffered the writes and 
only wrote to the device (e.g., socket) if: (1) the buffer limit 
had been reached during the write, or (2) the developer ex- 
plicitly flushed the buffer. If the channel attempted to write 
the buffer contents and failed, it would return, leaving the 
contents as they were. The error returned from the chan- 
nel layer was caught by the marshaling function, which then 
returned to the code that invoked it; importantly, however, 
the prefix of the byte representation was left in the channel 
buffer and could be sent later due to other invocations. 

Our example test case is complex, but we created sim- 
pler test cases that caused other problems. Besides causing 
servers to accept requests that no client issued, this bug can 
cause: requests to be silently discarded; large requests to not 
be sent when they exceed the UDP limit; and small requests 
to not be accepted by the receiver when they are concate- 
nated with large requests. 

In addition to reporting this bug to Chapar developers, we 
also reported it to OCaml developers. We did so because the 
current semantics of the OCaml library are difficult to use 
correctly, and this problem is not mentioned in the library 
function documentation. The OCaml developers confirmed 


the problem and, in response to our bug report, have been 
actively discussing possible workarounds and fixes. 

One workaround discussed by OCaml developers is the 
following: (1) marshal the OCaml object into an external 
buffer, (2) check whether the length of the buffer is smaller 
than the UDP size, and (3) if it is smaller, then manually 
invoke the channel write channel function on the external 
buffer. Unfortunately, this workaround does not solve all the 
problems because of other types of errors that could prevent 
the write calls from succeeding (like those we discuss be- 
low in BugV5). In practice, this bug may have a significant 
impact on verified systems reliability because of the number 
of such systems that use the OCaml library and may suffer 
from similar problems. 


4.2 Disk Operations 

We found three shim layer bugs related to disk operations. 
All these problems were caused by developers assuming that 
a single or set of disk operation(s) are atomic during crashes. 

Bug V3 : Incomplete log causes crash during recovery. 

BugV3 prevented replicas from recovering by causing 
them to repeatedly crash if the (disk) log were truncated. 
This problem was caused by the server code wrongly assum- 
ing that the entries in the log were always complete. This was 
not the case; the server used the write system call, which 
does not guarantee to write atomically when servers crash. 

In a deployment situation, an administrator could over- 
come this problem by manually discarding the incomplete 
entry at the end of the log. This would be safe (i.e., not cause 
loss of data), assuming the rest of the server logic were cor- 
rect because it would be equivalent to a server crash that oc- 
curs immediately before the write operation. Nevertheless, 
restoring service would require intervention from the admin- 
istrator and a correct diagnosis. 

Bug V4: Crash during snapshot update causes loss of data. 

This bug caused the server to lose data due to defective 
code that wrote the disk snapshots. Verdi’s shim layer imple- 
mented a snapshotting mechanism that, at every 1000 events, 
executed the following tasks: (1) wrote a new snapshot, (2) 
removed any previous snapshot, and (3) truncated the log. 
Unfortunately, data loss could occur because of the unsafe 
order in which the three tasks were executed. In particular, 
the implementation truncated the existing disk snapshot be- 
fore it safely wrote the new one to disk. Thus, a crash be- 
tween the truncation and write operations led to loss of data. 

Because this bug caused data loss, it is more serious 
than V3, as the administrator cannot easily recover from the 
problem. Fixing this bug would require consistently ensuring 
that durable information remains on the disk; in particular, 
the old snapshot should be deleted after the new snapshot is 
written to disk. 
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BugV5: System call error causes wrong results and data 
loss. 

This bug affected servers that were recovering and was 
ultimately caused by the server not correctly distinguishing 
between situations where there was both a log and snapshot 
and those where there was only a log. The latter occurred if 
the server crashed before it executed 1000 events (i.e., when 
the first snapshot is created). 

During recovery, the server tried to read the snapshot file 
and if it failed to open it, the server wrongly presumed that 
the snapshot file did not exist. In practice, this meant that 
a transient error returned by the open system call, such as 
insufficient kernel memory or too many open files, caused 
the server to silently ignore the snapshot. 

Our testing framework generated a test case that caused 
the servers to silently return results as if no operations had 
been executed before the server crashed, even though they 
had. This bug could also lead to other forms of safety viola- 
tions given that servers discard a prefix of events (the snap- 
shot) but read the suffix (the log), potentially passing valida- 
tion checks. Further, the old snapshot could be overwritten 
after a sufficient number of operations were executed. 

4.3 Resource Limits 

This section describes three bugs that involve exceeding 
resource limits. 

Bug V6 : Large packets cause server crashes. 

The server code that handled incoming packets had a bug 
that could cause the server to crash under certain conditions. 
The bug, due to an insufficiently small buffer in the OCaml 
code, caused incoming packets to truncate large packets and 
subsequently prevented the server from correctly unmarshal- 
ing the message. 

More specifically, this bug could be triggered when a 
follower replica substantially lagged behind the leader. This 
could occur if the follower crashed and stayed offline while 
the rest of the servers processed approximately 200 client 
requests. Then, during recovery, the follower would request 
the list of missing operations, which would all be combined 
into a single large UDP packet that exceeded the buffer size 
and crashed the server. 

The fix to this problem was to simply increase the size 
of the buffer to the maximum size of the contents of a 
UDP packet. However, bugs BugV7 and BugV8, which we 
describe next, were also related to large updates caused by 
lagging replicas but these are harder to fix. 

Bug V7 : Failing to send a packet causes server to stop re- 
sponding to clients. 

Another bug we found prevented servers from responding 
to clients when the leader tried to send large packets to 
a lagging follower. The problem was caused by wrongly 
assuming that there was no limit on the packet size and 
by incorrectly handling the error produced by the sendto 


let rec findGtlndex orig_base_params raft_params0 
entries i = 

match entries with 
I [] -> [] 

| e : : es -> 
if (<) i e.elndex 

then e : : (findGtlndex orig_base_params 
raft_params0 es i) 

else [] 


Figure 8: OCaml code, generated from verified Coq code, that 
crashed with a stack overflow error (Bug V8). In practice, the stack 
overflow was triggered by a lagging replica. 

system call. This bug was triggered when a replica that 
lagged behind the leader by approximately 2500 requests 
tried to recover. 

In contrast to BugV6, this bug was due to incorrect code 
on the sender side. In practice, the consequence was that 
a recovering replica could prevent a correct replica from 
working properly. The current fix applied by the developers 
mitigates this bug by improving error handling, but it still 
does not allow servers to send large state. 

Bug V6 and Bug V7 were the only two that we did not have 
to report to developers because the developers independently 
addressed the bugs during our study. 

Bug V8 : Lagging follower causes stack overflow on leader. 

After applying a fix for Bug V6 and Bug V 7, we found that 
Verdi suffered from another bug that affected the sender side 
when a follower tried to recover. This bug caused the server 
to crash with a stack overflow error and was triggered when 
a recovering follower lagged by more than 500,000 requests. 

After investigating, we determined that the problem was 
caused by the recursive OCaml function f indGtIndex() 
that is generated from verified code. This function, which 
constructed a list of missing log entries from the follower, 
was executed before the server tried to send network data. 
This was an instance of a bug caused by exhaustion of 
resources (stack memory). 

Figure 8 shows the generated code responsible for crash- 
ing the server with the stack overflow. This bug appeared 
difficult to fix as it would require reasoning about resource 
consumption at the verified transformation level (§2.3). It 
also could have serious consequences in a deployed setting 
because the recovering replica could iteratively cause all 
servers to crash, bringing down the entire replicated system. 

4.4 Summary of Findings 

Finding 1: The majority ( 9/11 ) of shim layer bugs caused 
servers to crash or hang. 

Bugs that cause servers to crash or stop responding are 
particularly serious, especially for replicated distributed sys- 
tems that have the precise goal of increasing service avail- 
ability by providing fault-tolerance. Therefore, proving live- 
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ness properties is particularly important in this class of sys- 
tems to ensure the satisfaction of user requirements. 

Finding 2: Incorrect code involving communication caused 
5 of 11 shim layer bugs. 

Surprisingly, we concluded that extending verification ef- 
forts to provide strong formal guarantees on communication 
logic would prevent half of the bugs found in the shim layer, 
thereby significantly increasing the reliability of these sys- 
tems. In particular, this result calls for composable, verified 
RPC libraries. 

Finding 3: File system operations were responsible for 3 of 
11 shim layer bugs. 

File system semantics are notoriously difficult for devel- 
opers to understand, especially in a crash-recovery model. 
The bugs we found were all located in the unverified recov- 
ery component of Verdi, the only system we studied that im- 
plements durability. This result confirms the importance of 
recent efforts to formalize file system semantics and verify 
file systems [3, 12, 51]. 

Furthermore, we found the official OCaml library refer- 
ence documentation to be surprisingly terse and devoid of 
content. For instance, many functions provided by the basic 
operations module (Pervasives module) were documented 
with three or fewer sentences. This problem also affects file 
system functions which have particularly complex semantics 
due to possible error conditions. We look to have libraries, 
especially those relied upon by verified systems, with com- 
plete and accurate documentation. 

Finding 4: Three of 11 shim layer bugs were related to 
resource limits. 

Bug V6 and Bug V7 were caused by unreasonable assump- 
tions about: (1) the buffer size limits, and (2) the maximum 
size of UDP packets. This suggests that explicitly reason- 
ing about different types of resource limits is important for 
the reliability of systems. For example, IronFleet verified 
that message sizes always fit within a UDP packet, albeit, 
in this case, the state machine size was bounded to fit the 
UDP packet size. Ideally, verification should reason about 
resources and ensure reasonable bounds on their usage. 

Similarly, Bug V8, caused by assumptions on stack mem- 
ory size, confirms that reasoning about resource limits is vi- 
tal to prevent potentially serious bugs. In this case, the verifi- 
cation checks did not reason about the consumption of stack 
memory, which is an implicit resource. 

Finding 5: No protocol bugs were found in the verified 
systems. 

None of the bugs we found were due to mistakes in the 
implementation of distributed protocols (e.g., Paxos, Raft), 
which are well known to be complex and difficult to imple- 
ment correctly. Our results suggest that verification does im- 
prove the reliability of the verified components: all the bugs 
described in this section were located in the unverified shim 


layer code, in the unverified shim layer library (BugC3), or 
in the shim layer runtime (Bug V8). In fact, we found no shim 
layer bugs in IronFleet, which is the system studied with 
fewest unverified components. 


4.5 PK Toolchain: Preventing Shim-layer Bugs 

Our results demonstrate, with concrete examples, that over- 
all system correctness crucially depends on making cor- 
rect assumptions regarding the shim layer, which represents 
a small subset of the entire TCB. Motivated by these re- 
sults, this subsection argues for the adoption of testing ap- 
proaches that complement verification techniques by testing 
non-verified components that interface with verified ones. 
More specifically, as part of the verification methodology, 
the shim layer should be independently tested to detect bugs 
that arise from possible mismatches between assumptions 
made by verified code and the properties provided by the 
shim layer implementation. 

We built several test cases that specifically targeted 
Verdi’s shim layer, we incorporated these into our PK testing 
toolchain. Our test cases consist of three testing applications 
that we implemented in OCaml, which directly linked with 
Verdi’s shim layer (i.e., excluded the verified code). Each 
of these applications checks a different property that was as- 
sumed by the verified code: (1) the integrity of messages sent 
between servers, (2) the integrity of messages sent between 
clients and servers, and (3) the integrity of the abstract state 
machine log during recovery. Even though neither Verdi nor 
Chapar aimed to prove liveness properties, using timeout 
mechanisms, our toolchain tests for liveness, which is nec- 
essary to detect serious classes of bugs, such as those that 
crash or hang servers. 

In addition to test cases, we implemented a file system 
and network fuzzer that transparently, using LD_PRELOAD, 
modifies the environment, unmasking bugs that would oth- 
erwise remain undetected. For instance, our fuzzer emulates 
different behaviors permitted by OS semantics, such as re- 
ordering UDP packets, duplicating UDP packets, executing 
non-atomic disk writes, and producing spurious system call 
errors. Our experiments demonstrated that our testing infras- 
tructure detects all shim layer bugs found, except for Bug V8 
which is caused by implicit resource usage. 

Using formal verification techniques, reasoning about im- 
plicit memory usage and verifying that it is guaranteed to be 
within given bounds would prevent bugs like the stack over- 
flow bug [8]. However, it is unclear how to apply these tech- 
niques to verify the resource usage of distributed systems. A 
middle-ground approach would be to design test cases using 
tools such as our fuzzer and to monitor the resource usage of 
verified components, checking whether it matches expected 
resource consumption models. 
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5. Specification Bugs 

This section discusses two bugs that we found in the specifi- 
cation of the systems analyzed. Neither bug caused the cur- 
rent implementation of servers to crash or otherwise produce 
incorrect results (unlike the bugs discussed in §4); however, 
specification bugs partially void verification guarantees. In 
practice, both bugs would allow distributed system imple- 
mentations that return incorrect results to pass verification 
checks. 

Bug II : Incomplete high-level specification prevents verifi- 
cation of exactly-once semantics. 

We found that the high-level specification of IronFleet’s 
RSM server did not ensure linearizability because it did not 
specify that the implementation had exactly-once semantics 
even though it did implement this functionality. 

We demonstrated the problem by constructing a patch 
that modified the server. The patch disabled the deduplica- 
tion functionality, which had been implemented, by modify- 
ing only seven lines of the implementation. Notably, the 
patched implementation still verified. Our patch demon- 
strates that an implementation bug could prevent the servers 
from providing exactly-once semantics and this problem 
would not be detected by the verification process. 

RSM libraries usually implement exactly-once semantics 
by using per-client sequence numbers to identify the request. 
This mechanism lets servers distinguish duplicate requests, 
which can occur for several reasons. Duplicate requests can 
arrive at servers due to network semantics (i.e., the network 
can duplicate packets, and clients must retransmit packets if 
they suspect lost packets) and the fault model (clients need 
to resend requests if they suspect that a server might have 
crashed). 

We reported this bug to developers, who confirmed that 
the specification did not provide exactly-once semantics. 
However, they stated that: their understanding of lineariz- 
ability does not include exactly-once semantics; the state 
machine could implement de-duplication; and they had been 
aware of the absence of exactly-once semantics in the speci- 
fication. In response to our bug report, the developers added 
a comment to the source code, clarifying that the specifica- 
tion does not cover exactly-once semantics. 

We consider this a bug because generally applications 
expect replicated state machine libraries to provide exactly- 
once semantics and because the absence of this guarantee 
would cause incorrect results for applications that expect 
it [33]. Regardless of the definition of linearizability, this 
example demonstrates well the need to be clear about the 
exact specification. 

Bug C4 : Incorrect assertion prevents verification of causal 
consistency in client. 

The client application prog_photo_upload had a bug in 
an assertion (see Figure 9). The check simply asserted the 


(* Client 0 *) 

put (Photo, "NewPhoto’’) ; ; 
put (Post, "Uploaded");; 

(* Client 1 *) 

post <- get (Post);; 
if (string_dec post "Uploaded”) then 
photo <- get (Photo) ; ; 

if string_dec post then // Original 

+ if string_dec photo ”” then // Fix 

fault 
else 
skip 

Figure 9: Patch to fix BugC4 in a client example of Chapar 
(Clients . v). The original Coq code is equivalent to assert (post 
! = " " ) . Instead, the assertion should check that photo is non-null 
when the post is non-null. 

condition of the conditional branch (the if-condition). There- 
fore, the assertion always evaluated to true, which defeated 
its purpose. This bug was easily fixed by asserting that photo 
(instead of post) is non-null. 

Besides verifying the server, Chapar verified clients using 
model checkers to detect causal consistency violations. In 
this example, no other assertion correctly performed the 
causal consistency check in the client application. Of the 
three systems analyzed, only Chapar verified clients, and this 
was the only bug that we found in a client. All other bugs 
were discovered in servers or in verification tools. 

5.1 Summary of Findings 

Finding 6: Incomplete or incorrect specification can prevent 
correct verification. 

Even if verification tools are correct, specifications must 
be correct for verification to deliver its promise. Bug II 
showed an example of a subtle specification problem that 
could result in application bugs. Regardless of the exact 
definition of linearizability, the existence of different inter- 
pretations is already sufficient to lead to application bugs. In 
addition, we note that it was not initially obvious to us that 
the IronFleet specification did not guarantee exactly-once 
semantics. To complement the current verification method- 
ologies, we need techniques to test specifications. 

5.2 PK Toolchain: Preventing Specification Bugs 

As we have discussed, verification crucially relies on the 
correctness of the specification. This section discusses two 
types of techniques that seek to validate specifications them- 
selves, and we report our experience on applying them to 
IronFleet. The first technique, negative testing, tests by ac- 
tively introducing bugs into the implementation and confirm- 
ing that the specification can detect them during verification. 
The second technique, specification checking, relies on prov- 
ing properties about the specification. 

Negative testing. We built a testing tool that automati- 
cally modifies the implementation source code. Its goal is to 
help developers check whether the source code implements 
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more functions than required by the specification, a problem 
that we found in IronFleet (Bug II ). This could indicate an 
incomplete specification. 

Our tool performs three types of simple transformations 
to Dafny source code that disables code: (1) changing the 
values of sub-conditions, (2) preventing updates to struc- 
tures, and (3) commenting out entire statements. These 
transformations sufficed to generate the changes necessary 
to modify the implementation in a manner functionally 
equivalent to the patch we discussed for Bug 1 1 . Never- 
theless, like the types of transformations used in mutation 
testing [16], other types of transformations could be con- 
sidered, such as removing parts of statements or modifying 
statements in different ways. 

Our tool requires developers to specify the functions to be 
tested and an upper bound on the number of transformations 
per function. Our evaluation shows that when applying it to 
HandleRequestBatchlmpl, LProposerProcessRequest, 
and ProposerProcessRequest in IronFleet, the tool re- 
quires on average 377, 18, and 8015 iterations to generate 
the patches, respectively; these modified functions disabled 
the de-duplication function but still passed the verification 
checks. Currently, our tool simply picks random transforma- 
tions and applies them at random source code locations. If 
not guided by the developer, this process could be expensive 
given that several matching modifications must be made to 
pass the verification checks. For instance, IronFleet relies on 
a consistent protocol and implementation layer; therefore, 
to pass the verification checks, the transformed source code 
needs matching modification at both layers. 

Specification checking. A non-programmatic approach 
to find specification bugs relies on proving specification 
properties. We explored this approach by writing a test 
case in Dafny that combined the specification of IronFleet 
with the specification of the counter application it provides. 
By composing the two, we constructed a machine-checked 
lemma that confirmed the possibility of reaching any counter 
state after executing a single counter operation. This for- 
mally confirmed that the specification did not prevent dupli- 
cate execution of operations (Bug II ). 

Ideally, such tests should be built by developers that did 
not write the specification — adding a level of redundancy — 
and should be reused across projects. Alternatively, verifying 
the implementation of applications and formally composing 
it with the verified distributed system library layer could 
also increase confidence in the correctness of the distributed 
system specification. However, it would still leave open the 
correctness of the top-most specification. 

6. Verification Tool Bugs 

This section analyzes four bugs we found in verification 
tools. Like specification bugs (§5), none of these bugs 
crashed the server implementation or otherwise produced in- 
correct results. However, they invalidated verification guar- 


antees. In general, verification tool bugs can cause incorrect 
server implementations to pass the verification check even if 
the specification is correct. All the problems reported were 
found in either auxiliary tools or at the perimeter of the ver- 
ifier’s core functionality. 

Bug 12: Prover crash causes incorrect validation. 

A bug in IronFleet’s tools caused the verifier to falsely 
report that any program passed verification checks, includ- 
ing programs that asserted false. In addition, the verifier built 
binaries for incorrect programs. 

This bug was caused by a defect in NuBuild, a component 
of the verification infrastructure like Unix make. NuBuild re- 
peatedly invoked Dafny for each source code file to verify 
and, if it verifies, compile it. For each invocation, NuBuild 
parsed the output produced by Dafny and aborted the build 
process if it detected an error; otherwise, it continued veri- 
fying and eventually built the binary. 

Unfortunately, NuBuild incorrectly parsed the output of 
Dafny (see Figure 10). Dafny invoked both Boogie (the ver- 
ifier) and Z3 (the prover) and emitted a diagnostic message 
regarding the verification process and an additional message 
if the prover crashed. NuBuild’s parsing function mistak- 
enly terminated after consuming just the first message even 
if there were additional message regarding a prover crash. 
We found this bug because it was triggered by another bug 
that caused the prover, Z3, to crash (Bug 14). However, this 
bug could also be triggered in other situations that caused Z3 
to abruptly terminate, such as insufficient memory or other 
system errors. 

In this case, several aspects combined to increase the 
potential for an unsuspecting developer to be tricked by the 
incorrect verifier: (1) no error or warning was made visible 
to the user, (2) the verifier built the program binary (and 
updated it if the source code changed), and (3) when the bug 
was triggered, the build duration did not change drastically. 

This bug is not necessarily triggered on every verifier in- 
vocation. Instead, it can be triggered by “transient” prob- 
lems, such as the termination of the prover by the OS due to 
insufficient resources (e.g., memory). Furthermore, because 
verification is computationally expensive and slow, NuBuild 
can offload verification to a cluster of remote machines. Un- 
der this setting, Dafny will run on many different machines 
and potentially different NuBuild installations. This aggra- 
vates the impact of the NuBuild bug because the verification 
process could sporadically and silently fail. 

This bug was confirmed by developers and fixed with our 
patch proposal. Our patch changed the order of the regular 
expressions used in the parsing function parseOutput(). 

Bug 13 : Signals cause validation of incorrect programs. 

When executing the verifier on Linux and macOS, we 
found that if the user sent a SIGINT signal, verification was 
interrupted (as expected), but Dafny misleadingly reported 
that no error occurred, and that the files were verified. This 
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noTimeouts = new Regex("Dafny„program„verifier„finishecL 
with„(\\d*)„verified,^(\\d*)^errors*'’) ; 
proverDied = new Regex("Prover„error:„Prover„died'') ; 

void parseOutput(out verificationFailures, 
out parseFailures) { 

match = noTimeouts. Match(output) ; 
if (match. Success) { 
verificationFailures = match .Groups[2] ; 
return; // <=== Returns when the prover dies 

} 

match = proverDied. Match(output) ; 
if (match. Success) { 

parseFailures = 1 ; // <=== Not executed 

return; 

} 

Figure 10: Simplified version of the NuBuild code responsible for 
Bug 12. The proverDied regular expression was never matched 
because the other regular expression matched first and returned 
from parseOutput(). Furthermore, Dafny only included errors 
found by the prover in the error count matched by the first regular 
expression, not errors executing the prover. 


void WaitForOutput() { 
try { 

outcome = thmProver.CheckOutcome(cce.NonNull(handler) 
); 

} 

catch (UnexpectedProverOutputException e) { 
outputExn = e; 

} 

+ catch (Exception e) { 

+ outputExn = new UnexpectedProverOutputException(e . 
Message) ; 

+ } 

Figure 11: Patch to fix Bug 13 in Boogie source code (Check. cs). 
Adding a general exception handler caught all exceptions thrown 
while the prover was executing. 

bug is similar to Bug 12 except that the verifier did not build 
the binary. 

This bug was caused by Boogie (a low-level verifier in- 
voked by Dafny) not handling correctly the exception thrown 
when the prover is interrupted. The exception handling code 
handled only UnexpectedProverOutputException excep- 
tions, but SIGINT threw a different type of exception. Boo- 
gie’s developers have fixed the problem by patching the ver- 
ifier (Figure 11). 

Bug 14: Incompatible libraries cause prover crash. 

The prover included in the IronFleet distribution failed 
to execute, with an error (0xc00000007b). The problem was 
caused by the inclusion of incompatible libraries in the pack- 
age; the included z3 . exe binary was built for 64-bit archi- 
tectures, while the binary libraries included were built for 
32-bit architectures. 


The problem can be more serious than it appears because 
the prover was not invoked directly by users; rather, it was 
invoked by other verifier components, some of which had 
defective error detection mechanisms. In fact, this problem 
triggered Bug 12. The solution to this bug is simple: after we 
reported it, the developers fixed it by updating the libraries 
with matching architectures. 

6.1 Summary of Findings 

Finding 7: There were critical bugs in current verification 
tools that could compromise the verification process. 

Verification tools are complex and increasingly auto- 
mated. In addition, they evolve quickly given their growing 
popularity. Thus, it is not surprising that they contain bugs. 
However, it is surprising that we found a combination of 
bugs (Bug 12 and Bug 14) that could mislead unsuspecting 
developers, potentially with serious impact on the correct- 
ness of verified programs. The correctness of verification 
tools becomes even more relevant if the programmer is an 
adversary [45]. 

Finding 8: All critical verifier bugs were caused by functions 
that were not part of the core components of the verifier. 

Surprisingly, the critical bugs found in verification tools 
(Bug 12 and Bug 13) were not caused by the verifier’s core 
components (i.e., the parts that reason about proofs). Instead, 
they were found in auxiliary tools (Bug 12) and in the veri- 
fier’s exception handling (Bug 13). These results call for bet- 
ter methodologies to design and compose the various compo- 
nents of verification infrastructures to ensure either correct 
or, at least, fail-safe operation (e.g., reporting a verification 
error rather than success if there is any exception). 

6.2 PK Toolchain: Preventing Verification Tool Bugs 

As the bugs we found attest, verification infrastructures can 
contain serious bugs that potentially compromise the verifi- 
cation process. The problem of verifier correctness has been 
studied in the context of traditional verifiers, and several 
techniques have been proposed, including verified code ex- 
traction [44]. 

We developed verifier test cases, consisting of sanity 
checks, that deliberately caused the verification process 
to fail under different scenarios. This process determined 
whether the verification infrastructure could detect certain 
classes of verification problems. Although simple, these tests 
enabled our testing toolchain to detect the bugs that affected 
NuBuild and Z3. We argue that sanity checks should be con- 
ducted to the verification infrastructure and under its actual 
execution environment — at the very least, when generating 
the system binaries that will be deployed. 

Interestingly, there has been significant recent interest 
in increasing the level of automation of modern verifiers. 
As a consequence, recent verifiers have become extremely 
complex. Whereas traditional verifiers (e.g., SAT solvers) 
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Figure 12: Summary of the unverified distributed systems. 


relied on relatively simple artifacts, the Dafny verifier, for 
example, relies on Boogie, which in turn relies on the Z3 
SMT solver, which itself is more complex than traditional 
verifiers. Furthermore, to mitigate the impact on verification 
time caused by the high degree of automation, verifiers are 
becoming distributed systems that rely heavily on caching 
across multiple machines. This trend suggests that some of 
the techniques and methodologies that have been developed 
to improve the robustness of other systems should now be 
considered for modern verifiers. 

7. Response from Developers 

The developers confirmed the existence of all problems we 
reported. However, as discussed earlier, they did not consider 
Bug II to be a bug because of their different understanding of 
linearizability. The developers agreed to apply the patch we 
proposed for Bug C4. Regarding Bug Cl -3, developers stated 
that causal consistency is guaranteed if the explicit commu- 
nication properties in the semantics hold and suggested dif- 
ferent fixes to improve the implementation (not using UDR 
modeling reordering, using acknowledgments, and limiting 
input size). Bug VI -5 and BugV8 were confirmed by the de- 
velopers. As shown in Figure 3, the rest of the bugs have 
already been fixed. 

8. Toward “Bug-Free” Distributed Systems 

To gain insight into how we can move towards “bug-free” 
distributed systems, we tried to understand what are the com- 
ponents and sources of reliability problems in modern de- 
ployed distributed systems. Most of these systems imple- 
ment a large set of features that have yet to be verified in any 
distributed systems analyzed, although some of these fea- 
tures are particularly complex (thus potentially bug prone) 
and important for real-world users. 

8.1 Methodology 

Our analysis relied on the inspection of reports of known 
bugs by sampling bugs from the issue trackers of each un- 
verified system (Figure 12). Due to the large volume of bug 
reports, we restricted our analysis to confirmed reports open 
between March 2015 and March 2016 (a 1-year span). In ad- 
dition, we discarded low-severity bugs and bugs that did not 
affect functional correctness. Figure 13 presents an overview 
of the results which support the findings in §8.2. 

Limitations. We do not intend to compare the bug count 
between the verified and unverified systems due to their sig- 
nificant differences. These unverified systems are not re- 
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search prototypes; they implement numerous and complex 
features, have been tested by innumerable users, and were 
built by large teams. Furthermore, the analysis methodolo- 
gies differ because the scale of the unverified system would 
make it impractical for us to manually find undiscovered 
bugs, as we did for verified systems. Instead of aiming at 
a direct comparison, our analysis of unverified systems was 
motivated by the need to understand how future verification 
efforts can improve the reliability and robustness of real- 
world distributed systems. 

8.2 Results 

Finding 9: No protocol bugs were found in verified systems, 
but 12 bugs were reported in the corresponding components 
of unverified systems. 

This result suggests that recent verification methods de- 
veloped for distributed systems improve the reliability of 
components that, despite decades of study, are still not im- 
plemented correctly in real-world deployed systems. In fact, 
all unverified systems analyzed had protocol bugs in the one- 
year span of bug reports we analyzed. 

Finding 10: Most of the bugs in unverified systems were 
found in management (160) and storage layers (230). 

In part due to optimizations, the complexity of manage- 
ment tools and storage layers explains the unreliability of 
these components. This result strongly suggests that the ap- 
plication of verification techniques to these tools and layers 
could significantly improve the reliability of distributed sys- 
tems. Interestingly, much recent interest focused on the ver- 
ification of file systems [12, 50], Our observations support 
the interest in this research direction. 

Finding 11: A total of 24 bugs in these systems were caused 
by multi-threacled concurrency. 

Only 4.3% of the bugs (24 out of 555) reported in de- 
ployed systems were due to local concurrency. A closer anal- 
ysis of these bug reports showed that almost all concurrency 
bugs were in the storage layer. This relatively low number 
could be caused by the absence of concurrency or the use of 
coarse-granularity concurrency mechanisms in most compo- 
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nents. Another contributing factor could be that concurrency 
bugs are often under-reported [42] . 

Although real-world distributed systems often rely on 
concurrency, none of the verified systems that we analyzed 
implemented multi-threaded concurrency. In fact, the verifi- 
cation of shared-memory concurrent software, in general, is 
an active area of research that is considered extremely chal- 
lenging [22, 26]. Thus, it remains unclear how researchers 
will address this challenge in the context of complex dis- 
tributed systems and, accordingly, which testing techniques 
should be adopted. 

8.3 Discussion 

Correctly writing complex software is hard for developers. 
In traditional unverified systems, a single mistake made by 
developers when writing code can lead to serious bugs that 
immediately compromise the correctness of the application, 
causing crashes or incorrect results. Efforts verifying imple- 
mentations significantly improve this by adding a level of 
redundancy. 

Verified components can only fail, regarding verified 
properties, if developers introduce both an implementation 
bug and a verification bug (i.e., specification or verifier bug). 
Furthermore, those two bugs have to match: the verifica- 
tion bug has to cause the verification process to miss the 
implementation bug. This extra level of redundancy helps 
explain why we did not find any protocol-level bugs in any 
of the verified prototypes analyzed, despite that such bugs 
are common even in mature unverified distributed systems. 
Next we discuss different paths to improve the reliability of 
verified components that are protected by this redundancy 
and unverified components that remain vulnerable: 

Verifiers. We believe the routine application to verifiers 
of general testing techniques (e.g., sanity checks, test-suites, 
and static analyzers) and the adoption of fail-safe designs 
should become established practices. Due to the reliance on 
increasingly complex SMT solvers and caching mechanisms 
and because verifiers are becoming distributed systems, test- 
ing and correctly implementing verifiers is expected to be- 
come increasingly challenging. This increased complexity 
calls for the development of scalable testing techniques and 
improved verifier designs. 

Specification. In addition to verifier bugs, specification 
bugs could invalidate verification guarantees. Proving prop- 
erties about the specification or reusing specifications are 
two important ways to increase the confidence that they are 
correct. The latter is likely to occur naturally with the expan- 
sion of verification to other systems but the former would 
require the adoption of best practices that mandate the inclu- 
sion of test cases for specifications. 

Shim layer. As our study demonstrates, bugs in non- 
verified components, such as the shim layer, remain a seri- 
ous threat to the overall reliability of systems. Because these 
components are not covered by verification guarantees, a sin- 
gle implementation bug could compromise the overall sys- 


tem correctness. Furthermore, the shim layer is often not 
reused across projects — all surveyed verified systems had 
custom-built shim layers. Therefore, existing shim layers are 
likely to contain undiscovered bugs. Building reusable and 
well documented shim layers that are applicable to differ- 
ent applications would contribute to improve the reliabil- 
ity of verified systems. In addition, formally specifying the 
properties of shim layer, expected by the verified compo- 
nents, allows testing tools, as we showed with PK toolchain, 
to test the properties of the shim layer without having to 
test the verified components. Isolating the unverified com- 
ponents could significantly improve the scalability of testing, 
as compared with testing unverified systems, by reducing the 
amount of code that needs to be tested and by ensuring the 
required properties are clearly defined. 

9. Related Work 

Much work has been done to analyze the correctness of un- 
verified systems [11, 13, 19, 37, 41, 42, 48, 52]. In stark con- 
trast, Yang et al. [57] conducted the only other study, to our 
knowledge, that analyzed the correctness of & formally ver- 
ified implementation. By testing 11 compilers, they found 
more than 325 bugs, two of which were located in a veri- 
fied compiler (CompCert [35]). Like our study, their work 
concluded that verification, while effective in reducing com- 
piler bugs, does not replace testing — specifications are com- 
plex and seldom check end-to-end guarantees. In contrast to 
Yang’s study, our study targets a different class of verified 
implementations. Interestingly, we found examples of prob- 
lems in the verification tools themselves that their study did 
not uncover. 

The importance of distributed systems has prompted sig- 
nificant work on analyzing and improving their reliability. 
For instance, Yuan et al. [58] sampled and studied 198 bugs 
in five popular implementations of distributed systems. Their 
results showed that many serious bugs can be detected by 
testing the error-checking code. Guo et al. [21] studied cas- 
cading recovery failures that can bring down entire dis- 
tributed systems. In the context of minimizing execution 
traces, Scott et al. [49] found several bugs in an unverified 
implementation of the Raft protocol. 

The verification of distributed system protocols has been 
an important line of research with a long history. For in- 
stance, many protocol proposals provide a pen-and-paper 
proof: the RSM protocols (e.g., Paxos [32], Raft [47], and 
PBFT [10]) are notable examples. To prevent mistakes in 
pen-and-paper proofs [59], others have gone further and pro- 
posed machine-checked proofs [43]. As an alternative to 
proof-based methods, bounded model-checking techniques 
have been leveraged to increase confidence in the correct- 
ness of distributed systems [27, 55, 56, 59]. More recently, 
verified distributed system implementations, which we stud- 
ied, have been proposed to extend formal guarantees to ac- 
tual implementations [23, 36, 54]. 
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10. Conclusion 

This work presents the first comprehensive study on the cor- 
rectness of formally verified implementations of distributed 
systems. Our study found 16 bugs that were caused by a 
wide-range of incorrect assumptions. We thoroughly ana- 
lyzed these bugs and their underlying causes, which sug- 
gest that only a small fraction of the TCB was responsible 
for these problems; hence, this subset should be the focus 
of special attention. Our analysis suggested that verification 
was effective at preventing protocol bugs that occur in un- 
verified systems. We conclude that verification, while bene- 
ficial, posits assumptions that must be tested, possibly with 
testing toolchains similar to the PK toolchain we developed. 
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